We encountered issues with our dataset sourced from Kaggle, namely
related to counterintuitive findings that raised concerns about the
accuracy of certain attributes. For instance, the dataset suggests that
experiencing chest pain during exercise correlates with a lower
likelihood of heart disease, while showing no chest pain increases the
risk of having heart disease, an interpretation that contradicts science
(Hamada, 2025). This anomaly suggests that the target
variable in our Kaggle dataset may have been reversed by the uploader —
where a value of 0 actually represents a diseased heart and 1 indicates
a healthy heart — rather than the original convention where 0 = healthy
and 1 = diseased. The dataset also seems to contain multiple duplicate
rows of information, potentially raising questions regarding the
reliability of the source (Kalsi, 2024). This can be problematic as
there are 723 rows of duplicated entries, which could introduce bias and
subsequently negatively impact model performance (Rehman, 2023).
Moreover, there are incorrect entries, perhaps reflective of “fat
finger” errors when transcribing data from the original data source
(Mahan, 2022). For example, the predictor that should only have entries
0-2, as can be seen in the screenshot of attribute information as
provided by Kaggle, but there are 410 entries containing 3. Here, what 3
stands for can be questioned.

Thus, while the coefficients discussed in our report remain valid in terms of magnitude and relative importance, their directional interpretation should be considered in reverse.
We begin by requiring all necessary packages for this document to run. Any additional code reduces error output and helps the packages collaborate better.
We also define the current environment so that this document is self-sufficient in retrieving the necessary files, like the dataset, independently regardless of which machine it runs on.
# Get and change the current working directory in order to retrieve the
# `data_path` and `predictions_path`
setPath <- dirname(getSourceEditorContext()$path)
data_path <- paste0(getwd(), "/../data/")
predictions_path <- paste0(getwd(), "/../predictions/")
Heart <- read.csv(paste0(data_path, "Heart.csv"))
To ensure that the data will be read by R correctly we do some manual preprocessing which involves renaming the response column, reording columns, and recoding factor variables. A preview of the first 6 lines of the data are shown below.
# Copy the `target` column to `y` (need unfactored `target` for gradient descent)
Heart$y <- Heart$target
# Move the last column to the front so that the cross-validation works
Heart <- Heart[, c(ncol(Heart), 1:(ncol(Heart) - 1))]
# Change `y` to be factor so that it is recognised as classification
Heart_non_fctr <- Heart
Heart$y <- factor(Heart$y)
# Recode `sex` for better interpretability
Heart <- Heart %>% mutate(sex_factor = recode(sex, "0" = "Female", "1" = "Male"))
# Get number of predictors/features
n_preds <- sum(names(Heart) != "y")
head(Heart)
Finally, we split the dataset into training and
testing subsets by using tidymodel’s
initial_split method, which partitions the data such that
70% of the observations are allocated to the training set and the
remaining 30% are allocated to the testing set.
Heart_split <- initial_split(Heart, prop = 0.7)
Heart_train <- Heart_split %>% training()
Heart_test <- Heart_split %>% testing()
Heart_valid <- Heart_split %>% testing()
We start by taking a look at the joint distribution of predictors
against each other and the outcome, y.
Heart[,1:5] |>
ggpairs(progress = F)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
for (i in seq(2,n_preds,3)) {
plot <- ggpairs(
Heart[, c(i:(min(n_preds,i + 2)), 1)],
progress = FALSE
) +
theme_minimal()
print(plot)
}
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Note in the above that the distribution of most of the integer
variables appear to have a discrete Gaussian/bell-curve type
distribution within each class (with the exception of
trestbps (resting blood pressure)).
Moreover, the distributions for both integer and factor variables can be distinct conditional on each class. Where the distributions are conditionally Gaussian implies that both the mean and covariance for each class is different for each class, possibly motivating a QDA [please write out the acronym in full before using it].
However, this is not precisely a normal (conditional on
y), limiting the possible usefulness of QDA as a model
fitting procedure.
Naive Bayes also appears a dud, as we often observe high correlations between the features/predictors across both classes, and naturally this is likely to extend to high correlations within class as well. Therefore, we cannot make the necessary argument that the predictors are independent within a given class.
Knowing the class balances in our data will help explain our results and glean insight into the strengths and limitations of these data. As shown below, the class balances appear balanced.
# We need the training set to be balanced, yet the validation set to have 68% balance.
outcome_counts_fold <- Heart |> count(y)
outcome_counts_fold$proportion <- outcome_counts_fold$n / sum(outcome_counts_fold$n)
const_yhat_success <- outcome_counts_fold$proportion[[2]]
outcome_counts_fold
const_yhat_success
## [1] 0.5131707
For each model fitting procedure, we create a grid of tuning parameters and model inputs (including a choice of predictors and functional forms where applicable).
We then iteratively fit all of these specifications and calculate the validation/cross-validation set accuracy, choosing the “best” model usually as that with the highest accuracy, or the simplest specification with a high enough accuracy.
In our classification setting, this is the misclassification rate.
We then store this specification (all the inputs we need to run this fit) and its accuracy in this aggregated table, repeating this for all model fitting procedures.
In other words, each row corresponds to a model fitting procedure (e.g. GAM, tree, OLS) [we should write out in full what acronyms are before using them as good practice] with each column giving some parameter or information about the specification that achieved the best accuracy given that model procedure.
This allows us to create the table above to store any number of tuning parameters and use any form of accuracy measure, in this case, we use the misclassification (misclass) rate, and 3 tuning parameters.
In this table, sub-models and functional forms and combinations of predictors are also counted as tuning parameters, just for maintainability, although this is not strictly true.
gen_models_df <- function(len=10, accuracy_measures = c("min_mse")) {
df = data.frame(
model_type = character(len),
model_name = character(len)
)
for (measure in accuracy_measures) {
df[[measure]] <- numeric(len)
}
#tp stands for tuning parameter
df$tp1_name = character(len)
df$tp2_name = character(len)
df$tp3_name = character(len)
df$tp1_value = numeric(len)
df$tp2_value = numeric(len)
df$tp3_value = numeric(len)
return(df)
}
Heart_models <- gen_models_df(accuracy_measures = c("misclass_rate"))
We define a set of functions to select the “best” tuning parameters for different models, often defined as the simplest model past a threshold level of prediction accuracy.
This takes predictions as inputs and calculates the validation misclassification rate (or, optionally, MSE) accordingly.
calculate_error <- function(predictions, true_values, classify) {
if (classify) {
return(mean(predictions != (as.numeric(true_values) - 1)))
} else {
return(mean((predictions - true_values)^2))
}
}
We use tidymodels syntax for efficiency and modularity
with different fitting procedures.
Uses a helper function to get the model specification (including the
fitting procedure and the engine) into tidyverse’s desired
format first.
# Function to get model specification
get_model_spec <- function(model_fitting_procedure, engine, tuning_params) {
if (model_fitting_procedure == "tree") {
model_spec_updated <- decision_tree(
mode = "classification",
cost_complexity = tuning_params$cp,
tree_depth = tuning_params$maxdepth,
min_n = tuning_params$min_n
) %>%
set_engine(engine)
} else if (model_fitting_procedure == "random_forest") {
model_spec_updated <- rand_forest(
mtry = tuning_params$mtry,
trees = tuning_params$trees,
mode = "classification"
) %>%
set_engine(engine, probability = TRUE)
} else if (model_fitting_procedure == "boosting") {
model_spec_updated <- boost_tree(mode = "classification") %>%
set_engine(engine)
} else if (model_fitting_procedure == "logit") {
model_spec_updated <- logistic_reg(mode = "classification") %>%
set_engine(engine, family = "binomial")
} else {
stop("Invalid model fitting procedure. Choose from 'tree', 'random_forest', 'boosting', or 'logit'.")
}
return(model_spec_updated)
}
# Function to get the accuracy for each model specification corresponding to the tuning parameters
grid_validate_tidy <- function(train_data, valid_data, tuning_grid, model_fitting_procedure, engine) {
# Initialize empty data frames to store results and predictions
accuracy_df <- tuning_grid
all_preds_df <- data.frame()
model_fits <- list() # Initialize the list to store model fits
# Iterate over each combination of hyperparameters in the tuning grid
for(i in 1:nrow(tuning_grid)) {
# Extract current tuning parameters
tuning_params <- tuning_grid[i, ]
# Get the model specification dynamically
model_spec_updated <- get_model_spec(model_fitting_procedure, engine, tuning_params)
# Create a workflow
current_wf <- workflow() %>%
add_model(model_spec_updated) %>%
add_formula(as.formula(tuning_params$formula))
# Fit the model on the training data
model_fit <- fit(current_wf, data = train_data)
# Store the model fit in the list
model_fits[[i]] <- model_fit # Index the model fit by i
# Predict probabilities on the validation set
prob_predictions <- predict(model_fit, valid_data, type = "prob")$.pred_1
# Apply threshold to classify predictions
predictions <- as.factor(as.numeric(prob_predictions > tuning_params$thresh))
# Calculate misclassification rate
predictions <- factor(predictions, levels = levels(valid_data$y)) # I had to force level matching because when fbs is used as a single predictor, it throws a level mismatch error, making stepwise selection a problem
error <- mean(predictions != valid_data$y)
# Store results
accuracy_df$misclass_rate[i] = error
# Put the accuracy results first and the formula last
accuracy_df <- accuracy_df %>%
select(misclass_rate, everything(), -formula, formula)
# Store predictions
preds_df <- data.frame(preds = predictions, prob_preds = prob_predictions) %>%
bind_cols(valid_data %>% select(y)) %>%
mutate(spec_no = i)
all_preds_df <- rbind(all_preds_df, preds_df)
}
# Return both results and predictions, including the list of model fits
return(list(
accuracy_df = accuracy_df,
preds = all_preds_df,
model_fits = model_fits # Add the list of model fits to the output
))
}
extract_predictors <- function(formula) {
# Remove "y ~" from the formula
formula_rhs <- gsub("y ~", "", formula)
# Trim spaces
formula_rhs <- trimws(formula_rhs)
# Split into vector if not empty, otherwise return empty vector
if (formula_rhs == "" || formula_rhs == "y ~") {
return(character(0)) # Return empty vector if no predictors in the formula
} else {
return(unlist(strsplit(formula_rhs, " \\+ "))) # Split by " + "
}
}
fwd_bckwd_step <- function(train_data, valid_data, tuning_params, model_fitting_procedure, engine, direction = "fwd") {
# Get all predictor names except the target variable "y"
predictors <- setdiff(names(train_data), "y")
# Split current_predictors string into individual variable names, if any
current_predictors <- extract_predictors(tuning_params$formula) # Fixed here, using extract_predictors function directly
# Find remaining predictors not in the current predictor set
predictors_left <- setdiff(predictors, current_predictors)
# Track the number of predictors currently in the model
n <- length(current_predictors) # Current number of predictors in the model
# Generate formulas based on direction
if (direction == "fwd") {
# Forward step: Generate new formulas by adding one predictor at a time
predictor_combos <- expand.grid(current = paste(current_predictors, collapse = " + "), new = predictors_left, stringsAsFactors = FALSE)
if (length(current_predictors) == 0) {
predictor_combos$formula <- paste("y ~", predictor_combos$new)
}
else {
predictor_combos$formula <- paste("y ~", predictor_combos$current, "+", predictor_combos$new)
}
} else if (direction == "bkwd") {
# Backward step: Generate new formulas by removing one predictor at a time
predictor_combos <- expand.grid(current = paste(current_predictors, collapse = " + "), remove = current_predictors, stringsAsFactors = FALSE)
predictor_combos$formula <- apply(predictor_combos, 1, function(row) {
remaining_vars <- setdiff(current_predictors, row["remove"])
paste("y ~", paste(remaining_vars, collapse = " + "))
})
}
# Expand tuning_params to match the number of new formulas
tuning_params <- tuning_params %>% select(-formula)
tuning_rows <- nrow(tuning_params)
predictor_rows <- nrow(predictor_combos)
fwd_tuning_grid <- tuning_params %>% slice(rep(1:tuning_rows, each = predictor_rows))
# Assign new formulas to the expanded tuning grid
fwd_tuning_grid$formula <- as.character(predictor_combos$formula)
# Calculate the accuracy for each specification in the grid
valid_output <- grid_validate_tidy(train_data, valid_data, fwd_tuning_grid, model_fitting_procedure, engine)
accuracy_df <- valid_output$accuracy_df
preds <- valid_output$preds
# Extract the best model (the one with the lowest misclassification rate)
best_index <- which.min(accuracy_df$misclass_rate)
best_model <- accuracy_df[best_index,]
best_model <- best_model[1,]
# Add a column that says that the predictions belong to the best model
preds <- preds %>% mutate(best = ifelse(spec_no == best_index, 1, 0))
# Get the final new formula with the new predictors from the best model
new_formula <- best_model$formula # Using extract_predictors to split the formula
# Output the best model and the new set of predictors
return(list(accuracy_df = accuracy_df, preds=preds, best_model = best_model, new_formula = new_formula, n = n))
}
stepwise_selection <- function(train_data, valid_data, tuning_params, model_fitting_procedure, engine, predictors_lim = 4, fwd_steps = 3, bkwd_steps = 2) {
# Get all predictor names except the target variable "y"
predictors <- setdiff(names(train_data), "y")
# Get the top number of predictors we can use
n_predictors <- predictors_lim
# Initialize counters for forward and backward steps
total_fwd = 0
total_bkwd = 0
n = 0 # start with 0 predictors in the model
all_preds_df <- data.frame() # initialise an empty dataframe to store the predictions from each spec
accuracy_df <- data.frame() # Initialise an empty dataframe for the accuracy results
# Perform major iterations (each major iteration consists of fwd_steps forward and bkwd_steps backward)
while (n < n_predictors) {
# Forward steps (for each major iteration, take fwd_steps forward)
for (i in 1:fwd_steps) {
if (n + 1 <= n_predictors) {
total_fwd = total_fwd + 1
n = n + 1
# Perform a forward selection step using fwd_bckwd_step
fwd_step_output <- fwd_bckwd_step(train_data, valid_data, tuning_params, model_fitting_procedure, engine, direction = "fwd")
# Store the best model for this forward step
best_model <- fwd_step_output$best_model
# Update selected_predictors to the new predictors from the best model
tuning_params$formula <- as.character(fwd_step_output$new_formula)
# Append a new row to accuracy_df with misclass_rate, total_fwd, etc., and tuning_params
new_row <- cbind(
total_steps = total_fwd + total_bkwd,
total_fwd = total_fwd,
total_bkwd = total_bkwd,
n = n,
misclass_rate = best_model$misclass_rate,
tuning_params
)
accuracy_df <- rbind(accuracy_df, new_row)
preds_df <- fwd_step_output$preds %>%
mutate(total_steps = total_fwd + total_bkwd)
all_preds_df <- rbind(all_preds_df, preds_df)
}
}
# Backward steps (after forward steps, perform bkwd_steps backward)
# Only start if can make at least the min number of backward steps
if ( n >= bkwd_steps+1) {
for (i in 1:bkwd_steps) {
if (n -1 >= 1) {
total_bkwd = total_bkwd + 1
n = n - 1
# Perform a backward selection step using fwd_bckwd_step
bkwd_step_output <- fwd_bckwd_step(train_data, valid_data, tuning_params, model_fitting_procedure, engine, direction = "bkwd")
# Store the best model for this backward step
best_model <- bkwd_step_output$best_model
# Update selected_predictors to the new predictors from the best model
tuning_params$formula <- as.character(bkwd_step_output$new_formula)
# Append a new row to accuracy_df with misclass_rate, total_bkwd, etc., and tuning_params
new_row <- cbind(
total_steps = total_fwd + total_bkwd,
total_fwd = total_fwd,
total_bkwd = total_bkwd,
n = n,
misclass_rate = best_model$misclass_rate,
tuning_params
)
accuracy_df <- rbind(accuracy_df, new_row)
preds_df <- bkwd_step_output$preds %>%
mutate(total_steps = total_fwd + total_bkwd)
all_preds_df <- rbind(all_preds_df, preds_df)
}
}
}
# If we can still take more forward steps and backward steps, continue another major iteration
if (n + fwd_steps > n_predictors) {
break # Stop if adding fwd_steps would exceed the total number of predictors
}
}
# Reorder columns to ensure misclass_rate and formula are first
accuracy_df <- accuracy_df[, c("misclass_rate", "formula", setdiff(names(accuracy_df), c("misclass_rate", "formula")))]
# Return the accuracy_df containing misclassification rates and tuning params for each step
return(list(accuracy_df = accuracy_df, preds = all_preds_df))
}
predictors <- names(Heart)[-1]
# Set up a set of formulae (subsets of predictors) that we can use
selected_predictors = c(paste(as.character(predictors[1]), collapse = " + "),
paste(as.character(predictors[1:4]), collapse = " + "),
paste(as.character(predictors[1:8]), collapse = " + "),
paste(as.character(predictors[1:12]), collapse = " + "))
formulas <- paste("y ~", selected_predictors)
This takes a specification grid output from
grid_validate_tidy() above as an input, plotting how
validation accuracy/error changes based on up to 3 dimensions.
First dimension is the x-axis, second dimension is the
colour, third is the shape.
These are both included in the legend.
The first point to have the minimum val_measure or error
measure (in this case, misclassification rate) is surrounded by a red
square, but this isn’t always the model we choose.
Often, our misclassification rate will be 0, as we only have so many observations and our predictions, like the true values, are binary, so either match exactly or not at all. If all predictions match exactly with the true values from the validation set, our prediction accuracy is maximised.
plot_grid <- function(grid, val_measure = "v_mse", tp1 = "n_preds_used", tp2 = NA, tp3 = NA, logx = FALSE) {
best_model <- grid[which.min(grid[[val_measure]]), ]
plot <- grid |>
ggplot(aes(x = .data[[tp1]])) +
geom_point(aes(
y = .data[[val_measure]],
color = if (!is.na(tp2)) as.factor(.data[[tp2]]) else NULL,
shape = if (!is.na(tp3)) as.factor(.data[[tp3]]) else NULL
), size = 2, alpha = 0.5) +
geom_point(data = best_model, aes(x = .data[[tp1]], y = .data[[val_measure]]),
color = "red", shape = 0, size = 4, stroke = 1.2) +
labs(
title = paste(val_measure, "vs.", tp1),
x = tp1,
y = val_measure
) +
expand_limits(y = 0.9 * min(grid[[val_measure]])) +
theme_minimal() +
theme(
legend.position = "bottom",
legend.box = "horizontal",
legend.margin = ggplot2::margin(t = 5, b = 5, unit = "pt"),
legend.key.height = unit(0.5, "cm"),
legend.spacing.x = unit(0.6, "cm")
) +
guides(
color = guide_legend(order = 1, ncol = if (!is.na(tp2) && !is.na(tp3)) 2 else 1),
shape = guide_legend(order = 2, ncol = if (!is.na(tp2) && !is.na(tp3)) 2 else 1)
)
if (!is.na(tp2)) {
plot <- plot + scale_color_discrete(name = tp2)
}
if (!is.na(tp3)) {
plot <- plot + scale_shape_discrete(name = tp3)
}
if (logx) {
plot <- plot + scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x))
}
print(plot)
}
In other words, this retreives the corresponding sub-model,
regressors, functional forms and tuning parameters that give us the best
validation set accuracy from a grid (like that output from
grid_validate()) where each row is some different
specification.
get_best_spec <- function(grid_df, grid_validation) {
e_column <- grep("misclass|mse", colnames(tree_df), value = TRUE)
best_idx <- which.min(grid_df[[e_column]])
best_e <- grid_df[best_idx, e_column]
best_row <- grid_df[best_idx,]
best_preds <- grid_validation$preds[[best_idx]]
best_valids <- grid_validation$valids[[best_idx]]
best_fits <- grid_validation$fits[[best_idx]]
return(list(preds=best_preds, valids=best_valids, fits = best_fits, error = best_e, row =best_row))
}
This is set up to work both for continuous outcomes
(preds against valids scatter plot) and a
categorical outcome (confusion matrix).
graph_preds <- function(preds, valids, cm=T, scatter=F, classify=T) {
predictions_df <- data.frame(y = valids, yhat=preds)
if (cm) {
confusion_matrix <-
predictions_df |>
conf_mat(truth = y, estimate = yhat)
print(confusion_matrix |> autoplot(type = "heatmap"))
}
if (scatter == T) {
if (classify) {
predictions_df$yhat <- as.numeric(predictions_df$yhat) - 1
predictions_df$y <- as.numeric(predictions_df$y) - 1
}
print(plot(predictions_df$yhat, predictions_df$y))
abline(0, 1)
}
error_col <- paste0(ifelse(classify, "misclass_rate", "mse"))
print(paste(error_col, ":", calculate_error(preds, valids, classify)))
}
(Insert everything related to the baseline model here, removing any redundancies shown above.)
We implemented a simple gradient descent algorithm to classify if patients have a presence of coronary artery disease or not given \(p \in \mathbb{N}\) medical characteristics \(\boldsymbol{\theta} = (\theta_{1}, \dots, \theta_{p})^\mathsf{T}\).
Consider a regularized logistic regression model with the parameters \(\boldsymbol{\beta}\) being a \((p+1) \times 1\) column vector containing our medical characteristics of interest (\(\boldsymbol{\theta}\)) and an added intercept coefficient. We are given \(N \in \mathbb{Z}\) observations. Let \(\boldsymbol{X} = (x_{1}, \dots, x_{N})\) be the design matrix of size \(N \times (p+1)\) for the input space where each \(\boldsymbol{x_{i}}\) is a row vector of dimension \(1 \times (p+1)\) denoting the \(i\)-th row of observations. Finally, suppose \(\boldsymbol{y} = (y_{1}, ..., y_{N})^\mathsf{T}\) be the target classes column vector of dimension \(N \times 1\) representing if each observation has CAD or not.
We first need to select our medical features of interest. Because we want this model to be relatively explainable, and figuring out how to include interactive terms is beyond the scope of this project, we want a smaller input space. We decide to use forward stepwise selection as shown below.
subset_selection <- regsubsets(y ~ ., data = Heart_split %>% training())
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 1 linear dependencies found
summary(subset_selection)
## Warning in log(vr): NaNs produced
## Subset selection object
## Call: regsubsets.formula(y ~ ., data = Heart_split %>% training())
## 15 Variables (and intercept)
## Forced in Forced out
## age FALSE FALSE
## sex FALSE FALSE
## cp FALSE FALSE
## trestbps FALSE FALSE
## chol FALSE FALSE
## fbs FALSE FALSE
## restecg FALSE FALSE
## thalach FALSE FALSE
## exang FALSE FALSE
## oldpeak FALSE FALSE
## slope FALSE FALSE
## ca FALSE FALSE
## thal FALSE FALSE
## target FALSE FALSE
## sex_factorMale FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " "*" " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " "*" " " " " " " " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" " " " " " " " " " " "*" " " " " "*"
## 5 ( 1 ) "*" " " "*" " " " " " " " " " " " " "*" " " " "
## 6 ( 1 ) "*" "*" "*" " " " " " " " " "*" " " "*" " " " "
## 7 ( 1 ) "*" " " "*" " " "*" "*" " " " " " " "*" " " " "
## 8 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " " " " " " " " " "
## thal target sex_factorMale
## 1 ( 1 ) " " "*" " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" " "
## 4 ( 1 ) " " "*" " "
## 5 ( 1 ) " " "*" "*"
## 6 ( 1 ) " " "*" " "
## 7 ( 1 ) " " "*" "*"
## 8 ( 1 ) " " "*" " "
predictor_names <- c("exang", "oldpeak", "ca")
We see that the best 3-predictor model includes exang,
oldpeak, and ca which are exercised induced
angina, ST depression induced by angina, and the number of major vessels
colored by fluoroscopy, respectively. We analyze the correlations
between these predictors and see that there are some moderately positive
correlation happening as shown in the correlation heatmap below. This
correlation suggests some multicollinearity is happening.
correlation_matrix <- cor(Heart %>% select(all_of(predictor_names)))
corrplot(corr = correlation_matrix,method = "color")
We first define our predictors of interest in a vector which is used
throughout our implementation. We then filter the dataset for the
requested predictors and the the binary response target.
This model still uses the same training/testing subsets as the other
models but we filter out unnecessary predictors for our implementation
of gradient descent; these are then renamed to training and
testing for this section only.
training <- Heart_train %>% select(all_of(predictor_names), "target")
testing <- Heart_valid %>% select(all_of(predictor_names), "target")
We seek to find the optimal coefficients \(\boldsymbol{\beta} = (\beta_{0}, \dots, \beta_{p})^\mathsf{T}\) that minimizes the negative log-likelihoods with some added positive penalty term \(\lambda \in \mathbb{R}\). We decided to go for an L2 (ridge) regularization because of the following reasons:
We are building a simpler model with fewer predictors and so we ideally do not want them to shrink to zero, which would be the case if we used L1 lasso regularization;
The presence of slight multicollinearity suggests that ridge regression will outperform lasso regression since it distributes the penalty across the correlated predictors; and
L2 regularization is mathematically differentiable everywhere since it is a sum of squares whereas L1 uses a sum of absolutes which complicates our analytical solution.
The negative log-likelihoods, also known as the logistic loss, are given by the following equation
\[ \ell(\beta \; ; \; y \, \vert \, x) = -\sum_{i=0}^{n} \left [ \, y_{i} \log(\sigma(x_{i})) + (1 - y_{i}) \log(1 - \sigma(x_{i})) \, \right] + \lambda \sum_{j=1}^{p} \beta_{j}^{2}, \]
where \(\sigma(\bullet \; ; \; \beta)\) is given by the logistic function (commonly denoted as the sigmoid function)
\[ \sigma(x_{i}) = \frac{1}{1 + e^{-x_{i}\beta}} . \]
We first need to find the gradient of the loss function by computing derivatives
\[ \frac{\partial}{\partial \beta} \; \ell(\beta \; ; \; y \, \vert \, x) = -\sum_{i=0}^{n} \left [ \, y_{i} \frac{1}{\sigma(x_{i})} \frac{\partial \sigma(x_{i})}{\partial \beta} - (1 - y_{i}) \frac{1}{1 - \sigma(x_{i})} \frac{\partial (1-\sigma(x_{i}))}{\partial \beta} \, \right] + 2\lambda\beta. \]
We see that
\[ \frac{\partial}{\partial \beta} \; \sigma(x_{i} \; ; \; \beta) = \frac{\partial}{\partial \beta} \left ( \frac{1}{1 + e^{-x_{i}\beta}} \right) = \frac{x_{i} e^{-x_{i}\beta}}{(1 + e^{-x_{i}\beta})^{2}} = \sigma(x_{i})(1-\sigma(x_{i}))x_{i}, \]
and so substituting back into the gradient function simplifies the equation to
\[ \begin{align} \frac{\partial}{\partial \beta} \; \ell(\beta \; ; \; y \, \vert \, x) &= -\sum_{i=0}^{n} \left [ \, y_{i}(1 - \sigma(x_{i}))x_{i} - (1-y_{i})\sigma(x_{i})x_{i} \, \right] + 2\lambda\beta\\ &=\sum_{i=0}^{n} \left [ \, (\sigma(x_i) - y_{i}) x_{i} \, \right ] + 2\lambda\beta. \end{align} \]
(Notice we distributed the negative in front of the summation through and factored out the \(x_{i}\) found in both terms). We define these functions in R for later in the implementation.
We separate out different components of the gradient descent
algorithm in smaller, easier to understand functions. This helps to make
the code more readable and easier to debug. We first define the methods
logistic_func, logistic_loss, and
gradients based on the mathematical derivations found
above.
# s-shaped sigmoid function
logistic_func <- function(logits) {
1/(1 + exp(-logits))
}
# implemented using L2 ridge regularization (note: the intercept term does not get penalized)
logistic_loss <- function(betas, X, y, lambda) {
# note that `y_hats` are the predicted conditional probabilities having applied the sigmoid function on the design matrix and coefficients product NOT the predicted
# classes of heart disease presence
y_hats <- logistic_func(X %*% betas)
penalty <- lambda * sum(betas[-1] * betas[-1])
-sum(y * log(y_hats) + (1 - y) * log(1 - y_hats)) + penalty
}
# this function is not a numeric approximation of the gradient but rather the direct analytical
# functional form as derived in the mathematics above
gradients <- function(betas, X, y, lambda) {
# note that `y_hats` are the predicted conditional probabilities having applied the sigmoid function on the design matrix and coefficients product NOT the predicted
# classes of heart disease presence
y_hats <- logistic_func(X %*% betas)
# separate the logistic and penalty terms to compute gradients easier
logistic_gradient <- t(X) %*% (y_hats - y)
penalty_gradient <- 2 * lambda * betas[-1]
# sum the two components of the gradient while keeping the intercept term non-regularized
logistic_gradient + c(0, penalty_gradient)
}
The baseline helper functions are now defined to be used later on.
Our implementation requires that the input space be a matrix \(\boldsymbol{X}\) of a specific format which
design_matrix forces data to conform to. The
update_step_size uses the Barzilai-Borwein (BB) method for
updating how big the gradient should move (its step size) each iteration
of the descent based on the two previous estimated coefficients and
gradients. We settled on using BB to help the model more dynamically
converge without setting an arbitrary learning rate. Finally, the
gradient_descent function uses all of the previously
defined methods to run logistic gradient descent given some tuning
parameters (such as learning rate and lambda) and returns histories of
the coefficients, gradients, losses, step sizes, and predictor
names.
# format the training dataset by removing the target column, adding an intercept
# column of all 1s, and scaling the inputs before returning it as a matrix
# (necessary for the %*% notation used)
design_matrix <- function(training) {
training %>%
select(-c(all_of("target"))) %>%
mutate(intercept = 1, .before = 1) %>%
mutate_at(vars(-intercept), scale) %>%
as.matrix
}
# returns the optimal step size as computed by the Barzilai-Borwein method given
# the current and previous beta and gradient
update_step_size <- function(iteration, betas, gradients) {
betas_delta <- betas[iteration - 1,] - betas[iteration - 2,]
gradients_delta <- gradients[iteration - 1,] - gradients[iteration - 2,]
# we have found that taking long BB step sizes works better for these data
sum(betas_delta %*% betas_delta) / sum(betas_delta %*% gradients_delta)
}
gradient_descent <- function(training, predictor_names, lambda, tolerance, max_iterations) {
# format the X and y training values to be of use
X_training <- design_matrix(training)
y_training <- training$target
# randomly sample `p` uniformly-distributed values for the initial coefficients
# since the inputs are scaled in the range [0,1]
initial_betas <- runif(ncol(X_training))
initial_step_size <- runif(1, min = 0.001, max = 0.01)
# each value is an iteration of updated losses to keep track of its history;
# the same goes for the step sizes
losses <- numeric(max_iterations)
step_sizes <- numeric(max_iterations)
# each row is an iteration of updated coefficients to keep track of its history;
# the same goes for the gradients
betas <- matrix(NA, nrow = max_iterations, ncol = ncol(X_training))
gradients <- matrix(NA, nrow = max_iterations, ncol = ncol(X_training))
# save the initial coefficients as the randomly selected ones and manually
# kickstart the descent algorithm
betas[1,] <- initial_betas
gradients[1,] <- gradients(betas[1,], X_training, y_training, lambda)
step_sizes[1] <- initial_step_size
losses[1] <- logistic_loss(betas[1,], X_training, y_training, lambda)
# another iteration of manually computing the descent algorithm to ensure
# that it is working before looping through
betas[2,] <- betas[1,] - step_sizes[1] * gradients[1,]
gradients[2,] <- gradients(betas[2,], X_training, y_training, lambda)
step_sizes[2] <- initial_step_size
losses[2] <- logistic_loss(betas[2,], X_training, y_training, lambda)
for(iteration in 3:max_iterations) {
previous_beta <- betas[iteration - 1,]
previous_gradient <- gradients[iteration - 1,]
previous_step_size <- step_sizes[iteration - 1]
# compute the optimal step size using the Barzilai-Borwein method and then
# update the coefficients, gradients, and losses
step_sizes[iteration] <- update_step_size(iteration, betas, gradients)
betas[iteration,] <- previous_beta - step_sizes[iteration] * previous_gradient
gradients[iteration,] <- gradients(betas[iteration,], X_training, y_training, lambda)
losses[iteration] <- logistic_loss(betas[iteration,], X_training, y_training, lambda)
# we compute the norm, or size of, the gradient vector using the L2 norm
# (Euclidean norm) and check if it is reasonably small enough
if (norm(gradients[iteration,], type = "2") < tolerance) {
# trim the betas, gradients, losses, and step sizes matrices/vectors
# as to not have a bunch of trailing zeros or NULL values and then exit
# the loop
betas <- betas[1:iteration,]
gradients <- gradients[1:iteration,]
losses <- losses[1:iteration]
step_sizes <- step_sizes[1:iteration]
break
}
}
# return all of the matrices and vectors used in a R list which will be
# convenient by using the `$.` notation on the model output
list(
coefficients = betas,
gradients = gradients,
losses = losses,
step_sizes = step_sizes,
predictors = c("(Intercept)", predictor_names)
)
}
Note that we scale every column in the input space (besides the
intercept and target columns) in the design_matrix
function. When we ran this gradient descent algorithm on unscaled values
the logistic loss broke down after the first few iterations. A deeper
examination revealed that the scale of the input values could range from
126–564 mg/dL (in the case of cholesterol) whereas our response is only
binary. This size difference meant that \(\boldsymbol{X\beta}\) would either produce
extremely large or extremely small values which would become numerically
unstable once the sigmoid function was applied \(\sigma(\boldsymbol{X\beta})\).
Consider if \(\sigma(\boldsymbol{X\beta}) =
0\). Then, \(\log(\sigma(\boldsymbol{X\beta})) =
\log(0)\) which is undefined. Similarly, if \(\sigma(\boldsymbol{X\beta}) = 1\) then,
\(\log(1-\sigma(\boldsymbol{X\beta})) =
\log(0)\) which is also undefined. The output of sigmoid being
exactly 0 or exactly 1 happens when the result of \(\boldsymbol{X\beta}\) is numerically
unstable. The logistic_loss method would save the current
iteration’s loss as NaN (not a number) in the losses
history every iteration. The coefficients and
gradients contained large vectors meaning the descent
jumped around a lot and never converged on the local minima, only
stopping due to the max_iterations being met.
We scaled down the input space \(\boldsymbol{X}\) (besides the intercept and target columns) to combat the numerical instability and our implementation works as expected.
Before we run our logistic regression gradient descent implementation on our project data we want to assure that it works on simulated data. The following method does this and we show a preview of the generated data as well as the true coefficients.
simulate_data <- function(n_observations = 5000, p_predictors = 10) {
set.seed(1)
sparsity <- p_predictors / 2
nonzero_betas <- runif(sparsity, min = -1, max = 1)
true_coefficients <- c(.5, nonzero_betas, rep(0, sparsity))
# we again generate random values from a uniform distribution to match the scale
# of the inputs of the target dataset
raw_data <- matrix(
runif(n_observations * p_predictors),
ncol = p_predictors
)
X_simulated <- cbind(
rep(1, n_observations),
raw_data
)
y_simulated <- rbinom(
n = n_observations,
size = 1,
prob = logistic_func(X_simulated %*% true_coefficients)
)
# create the simulated dataset in the correct format as expected by the gradient
# descent method; additionally create dummy predictor names using R's default V{1,...,p}
# undefined column name notation
simulated_dataset <- cbind(X_simulated[,-1], target = y_simulated) %>% as.data.frame
predictor_names <- paste0("V", 1:p_predictors)
set.seed(NULL)
list(
data = simulated_dataset,
true_coefficients = true_coefficients,
predictors = predictor_names
)
}
p_predictors <- 4
simulated_training <- simulate_data(n_observations = 2000, p_predictors = p_predictors)
simulated_training$data %>% head
simulated_training$true_coefficients %>% print
## [1] 0.5000000 -0.4689827 -0.2557522 0.0000000 0.0000000
We also will define the simulated testing dataset which will be used later to evaluate the naive model later. It is generated in a similar manner to its training counterpart.
simulated_testing <- simulate_data(n_observations = 1000, p_predictors = p_predictors)
simulated_testing$data %>% head
simulated_testing$true_coefficients %>% print
## [1] 0.5000000 -0.4689827 -0.2557522 0.0000000 0.0000000
We then now define our naive model by calling our
gradient_descent implementation on the
simulated_training object so that we can later visualize
how the model performs on the testing stage and, eventually, how it
performs on unseen testing data.
naive_model <- gradient_descent(
training = simulated_training$data,
predictor_names = simulated_training$predictors,
lambda = 1.7,
tolerance = 1e-5,
max_iterations = 10000
)
We additionally define some methods for understanding the output of our model including a table of coefficients, a plot of losses over iterations, a plot of step sizes over iterations, and a plot of coefficients over iterations.
We can now analyze what the naive model converged on for its
coefficients versus the true coefficients by using the
compare_coefficients method. This function generates a
table of true and expected coefficients with a third column being their
difference to more easily view how big the difference is.
compare_coefficients <- function(modelA_coefficients, modelB_coefficients, predictors, column_names) {
# compute the difference in coefficients to more easily see how off our implementation estimates are
# either in the case of our model versus R's `glm` version or our naive model versus the true
# coefficients
difference <- modelA_coefficients - modelB_coefficients
# assemble the table and make it more readable with meaningful column and row names
coefficients_comparison <- cbind(modelA_coefficients, modelB_coefficients, difference)
colnames(coefficients_comparison) <- c(column_names, "difference")
rownames(coefficients_comparison) <- predictors
coefficients_comparison
}
naive_coefficients <- naive_model$coefficients[nrow(naive_model$coefficients),]
compare_coefficients(
modelA_coefficients = naive_coefficients,
modelB_coefficients = simulated_training$true_coefficients,
predictors = c("(Intercept)", simulated_training$predictors),
column_names = c("estimated coef.", "true coef.")
)
## estimated coef. true coef. difference
## (Intercept) 0.11342276 0.5000000 -0.38657724
## V1 -0.15215072 -0.4689827 0.31683195
## V2 -0.14944351 -0.2557522 0.10630869
## V3 0.03573019 0.0000000 0.03573019
## V4 -0.04513723 0.0000000 -0.04513723
We can also observe the logistic loss history saved in the
naive_model object and view it on a plot via the
plot_losses function. Note that this method accepts a
model_type parameter which is placed into the plot’s title
for clarity so that this function can be used by both the naive and
(later) the final models.
plot_losses <- function(model, model_type = "normal") {
n_iterations <- length(model$losses)
min_loss <- min(model$losses)
ggplot(data.frame(iteration = 1:n_iterations, loss = model$losses)) +
# connect each loss value per iteration with a solid blue line
geom_line(aes(x = iteration, y = loss), color = "blue", linewidth = 1) +
# visibly show where the minimum loss achieved is and add text with the number
geom_hline(yintercept = min_loss, linetype = "dotted", color = "gray", linewidth = 0.75) +
annotate("text", x = n_iterations * 0.9, y = min_loss, label = paste("minimum loss:", round(min_loss, 2)), vjust = -0.5, color = "darkgray") +
scale_x_continuous(
limits = c(1, n_iterations),
breaks = c(1:n_iterations)
) +
# we use the parameter `model_type` since we use this same method for both our actual model
# and the naive model
labs(
title = paste("Losses of binary cross-entropy of", model_type, "model over iterations"),
subtitle = paste("Fitted on the predictor(s)", toString(model$predictors)),
caption = "Data from the 1988 UCI Heart Disease study"
) +
theme_classic() +
theme(legend.position = "none") +
xlab("Iteration") +
ylab("Loss")
}
plot_losses(model = naive_model, model_type = "naive")
We can additionally observe the step sizes history saved in the
naive_model object and view it on a plot via the
plot_steps function. It is nearly identical to
plot_losses but with different labels and verbiage.
plot_steps <- function(model, model_type = "normal") {
n_iterations <- length(model$step_sizes)
ggplot(data.frame(iteration = 1:n_iterations, step = model$step_sizes)) +
# connect each step size value per iteration with a solid blue line
geom_line(aes(x = iteration, y = step), color = "blue", linewidth = 1) +
scale_x_continuous(
limits = c(1, n_iterations),
breaks = c(1:n_iterations)
) +
labs(
title = paste("Barzilai-Borwein step sizes of", model_type, "model over iterations"),
subtitle = paste("Fitted on the predictor(s)", toString(model$predictors)),
caption = "Data from the 1988 UCI Heart Disease study"
) +
theme_classic() +
theme(legend.position = "none") +
xlab("Iteration") +
ylab("Step size")
}
plot_steps(model = naive_model, model_type = "naive")
We lastly observe the coefficients history saved in the
naive_model object and view it on a plot via the
plot_steps function. This allows readers to see, first off,
a snapshot of all the coefficients across the iterations and, secondly,
see how each coefficient changes throughout the descent.
plot_betas <- function(model, model_type = "normal") {
n_iterations <- nrow(model$coefficients)
to_plot <- data.frame(
iteration = c(1:n_iterations),
coefficient = as.data.frame(model$coefficients)
)
colnames(to_plot) <- c("iteration", model$predictors)
# we re-shape the data so that we can connect each coefficient across iterations AND
# show all coefficients at each snapshot via pivoting the dataframe longwise
to_plot <- to_plot %>%
pivot_longer(cols = -iteration, names_to = "predictor", values_to = "coefficient")
ggplot(to_plot, aes(x = iteration, y = coefficient, color = predictor)) +
geom_point(size = 3, stroke = 1.2) +
geom_line() +
scale_x_continuous(
limits = c(1, n_iterations),
breaks = c(1:n_iterations)
) +
labs(x = "Iteration", y = "Coefficient") +
labs(
title = paste("Esimated coefficients of the", model_type, "model over iterations"),
subtitle = paste("Fitted on the predictor(s)", toString(model$predictors)),
caption = "Data from the 1988 UCI Heart Disease study"
) +
theme_minimal()
}
plot_betas(model = naive_model, model_type = "naive")
Having analyzed the naive model coefficients and visualized the
losses and step sizes, we now seek to evaluate it by applying the
testing subset and seeing how accurate its predictions are.
We do this with the prediction_metrics method.
glm modelWe will analyze the metrics of the naive model once we do the same
with the normal model and R’s glm model. Now we create
these model objects and examine the output of the normal model
output.
log_model <- gradient_descent(
training = training,
predictor_names = predictor_names,
lambda = 0.0,
tolerance = 1e-5,
max_iterations = 10000
)
r_training <- training %>% mutate_at(vars(-target), scale)
r_model <- glm(target ~ ., family = binomial, data = r_training)
Because we do not know the “true coefficients” of our actual heart
disease dataset we instead compare our model’s coefficients against R’s
glm model’s coefficients to see how similar the values are.
As seen in the table blow, the difference between our model and R’s
coefficients are on the scale of \(10^{-3}\) and smaller.
log_coefficients <- log_model$coefficients[nrow(log_model$coefficients),]
compare_coefficients(
modelA_coefficients = log_coefficients,
modelB_coefficients = r_model$coefficients,
predictors = log_model$predictors,
column_names = c("model coef.", "R glm coef.")
)
## model coef. R glm coef. difference
## (Intercept) -0.1112793 -0.1112794 1.240711e-08
## exang -0.8993333 -0.8993333 2.024936e-08
## oldpeak -0.9914144 -0.9914143 -1.104230e-08
## ca -0.8298566 -0.8298566 3.552889e-08
To see the significance of the selected predictors in explaining our
response, we examine the model output of glm. We see from
the table below that exang, oldpeak, and
ca are all statistically significant quantities.
summary(r_model)
##
## Call:
## glm(formula = target ~ ., family = binomial, data = r_training)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.11128 0.09994 -1.113 0.266
## exang -0.89933 0.10194 -8.822 < 2e-16 ***
## oldpeak -0.99141 0.12428 -7.977 1.5e-15 ***
## ca -0.82986 0.10533 -7.879 3.3e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.74 on 716 degrees of freedom
## Residual deviance: 656.09 on 713 degrees of freedom
## AIC: 664.09
##
## Number of Fisher Scoring iterations: 5
Examining the \(p\)-values for these
coefficients we see that exang, oldpeak, and
ca are all statistically significant predictors (all \(< 0.5 = \alpha\)) in explaining the
presence of coronary artery disease.
plot_losses(model = log_model, model_type = "normal")
plot_steps(model = log_model, model_type = "normal")
plot_betas(model = log_model, model_type = "normal")
We see that the logistic loss decreases sharply and lands on a minimum loss of 358.72 at around 15 iterations. The step size history is more random but regardless the movement of the gradient vector is extremely small as shown by the scale of the \(y\)-axis. Finally, the estimated coefficients over iterations plot shows that our model starts to level off after around the 5th iteration.
Another way that we can visualize our model’s versus R’s
glm model’s coefficients is by layering them over
iterations like shown in the plot below. We use a custom method called
coefficients_comparison.
coefficients_comparison <- function(model, r_model) {
final_coefficients <- model$coefficients[nrow(model$coefficients),]
r_coefficients <- r_model$coefficients
predictors <- model$predictors
to_plot <- data.frame(
predictor = factor(predictors, levels=predictors),
normal_coefficients = final_coefficients,
glm_coefficients = r_coefficients
) %>% pivot_longer(
cols = -predictor,
names_to = "model",
values_to = "coefficient"
)
ggplot(to_plot, aes(x = predictor, y = coefficient, shape = model, color = model)) +
geom_point(size = 3, stroke = 1.2) +
labs(x = "Predictor", y = "Coefficient") +
labs(title = "Comparing estimated coefficients between our normal and glm models",
subtitle = paste("Fitted on the predictor(s)", toString(predictors)),
caption = "Data from the 1988 UCI Heart Disease study") +
theme_minimal()
}
coefficients_comparison(model = log_model, r_model = r_model)
From the above plot we see that our normal model and R’s
glm model have essentially the same coefficients which
matches the comparison table seen earlier. On average, our model
coefficients are slightly higher than R’s glm coefficients
but not by a lot as shown by the small scale on the \(y\)-axis.
We can now analyze the prediction metrics of our model and R’s
glm model to evaluate their effectiveness as well as
compare to our naive model.
We now look at the accuracy and AUC metrics across all three models as summarized in the following table.
all_metrics <- data.frame(
"training accuracy" = c(
naive_metrics$training_accuracy,
log_metrics$training_accuracy,
r_metrics$training_accuracy
),
"training auc" = c(
naive_metrics$training_auc,
log_metrics$training_auc,
r_metrics$training_auc
),
"testing accuracy" = c(
naive_metrics$testing_accuracy,
log_metrics$testing_accuracy,
r_metrics$testing_accuracy
),
"testing auc" = c(
naive_metrics$testing_auc,
log_metrics$testing_auc,
r_metrics$testing_auc
)
)
rownames(all_metrics) <- c("naive model", "normal model", "glm model")
colnames(all_metrics) <- c("training accuracy", "training auc", "testing accuracy", "testing auc")
all_metrics
We want to also view the Receiver Operating Characteristic (ROC)
curve available via the pROC library in our wrapper
function of plot_ROC to compare the true and false positive
rates. Note that this is plotted on the testing data as
another visualization of model accuracy. As expected, the naive model is
slightly over 50% which indicates that our gradient descent
implementation picks up some slight noise in the data but is overall not
much better than flipping a coin to determine heart disease presence
based on exang, oldpeak, and ca.
Alternatively, our model and R’s glm model both have an AUC
of 81% which suggests that they perform strongly in distinguishing
between CAD presence and absence.
plot_ROC <- function(metrics, curve_color, y_height, add_to = FALSE) {
# `y_height` is the height where the AUC percentage label is printed so that they do not
# overlap; `add_to` is needed to overlap the naive, normal, and glm model ROC curves
roc_object <- roc(
response = metrics$true_classes_testing,
predictor = metrics$prediction_classes_testing,
plot = TRUE,
percent = TRUE,
xlab = "False positive percentage",
ylab = "True positive percentage",
col = curve_color,
lwd = 3,
print.auc = TRUE,
print.auc.y = y_height,
add = add_to,
quiet = TRUE
)
}
plot_ROC(metrics = naive_metrics, curve_color = "red", y_height = 45, add_to = FALSE)
plot_ROC(metrics = log_metrics, curve_color = "blue", y_height = 35, add_to = TRUE)
plot_ROC(metrics = r_metrics, curve_color = "lightgreen", y_height = 25, add_to = TRUE)
models <- c("naive model", "normal model", "glm model")
colors <- c("red", "blue", "lightgreen")
legend("bottomright", legend = models, col = colors, lwd = 3)
The above ROC plot is useful to see visually how the models compare but we would like to see hard values for the true and false positives and negatives across the testing subset for our model. This is done through the confusion matrix shown below.
testing_df <- data.frame(
y = as.factor(testing$target),
y_hat = as.factor(log_metrics$prediction_classes_testing)
)
confusion_matrix <- testing_df %>%
conf_mat(truth = y, estimate = y_hat) %>%
autoplot(type = "heatmap") %>%
print
As we see from the confusion matrix above our model correctly classifies heart disease prevalence the majority of the time, with the the few false negatives and false positives relatively balanced. This suggests that our model performs moderately well on the holdout data which is supported by an AUC score of around 79.4%.
We choose a short classification tree as a relatively interpretable model, given that it can be read and understood in an intuitive way without necessarily having specific domain knowledge.
This is because instead of using a complicated functional form for
how the outcome depends on its predictors, when a tree is fit to data,
at each level of depth it creates a single split into different
groupings, or “nodes.” For example, all those patients with
cp [we should probably write in English what
the predictors mean before using their acronym, especially in paragraph
form like this] above 0.5
[units?] may be assigned to one grouping and
those below to another group. This [divergence? should use a
word after “this” to clarify what you’re referring to]
means each grouping will have different characteristics (which are used
to predict probabilities). In turn, at each consequent level of depth,
these groupings are split further and further based on being either side
of a threshold for a chosen predictor.
Each of these splits is created in such a way to maximise node purity. This is done by assigning the predicted probability of the positive class conditional on being in that node (and having predictor values that put a patient in that grouping) as the proportion of training observations in the positive class.
It then seeks to achieve maximum node purity by minimising the Gini Index, which takes its smallest values (closest to 0) if either all the predicted probabilities are close to 1, or all of them are close to 0 (James, Witten, Hastie, & Tibshirani, 2023). Intuitively, it can be seen that when a grouping is more “pure” (i.e., the observations tend to have more similar outcomes), then it makes more sense for a tree to define a split that creates such a grouping, and consequently associate the characteristics that put a patient into that group as characteristics associated with the probability of the positive class in that group.
This ultimately creates discrete, mutually exclusive and collectively exhaustive final groupings/nodes, each of which have certain characteristics.
This recursive splitting process directly maps to flow diagram process for iteratively distinguishing and grouping patients. Therefore, it can be crucially informative for diagnosis.
A medical professional can proceed by gathering key data on a patient, and iteratively going through a checklist.
At the first bullet or “depth” level of the checklist, a piece of information on the patient is used to put them into a sub group, depending on whether they exceed or fall short of a given threshold on that data point.
At each subsequent “depth” level in the checklist, another piece of information on the patient (perhaps one already used before) is used to put them into a sub group of the group they were just in. Eventually, the patient is placed in one of the final groupings.
Based on this grouping, and the charactestics associated with being in that grouping, the medical professional can predict the chance (conditional on those characteristics) that that patient will have heart disease. If this probability is sufficiently high, a positive diagnosis can be given.
In the case of heart disease diagnosis, we might not simply want the highest accuracy, but perhaps particularly avoid false negatives (and maximise true positives, even at the cost of false positives). This depends, of course, on the treatments prescribed to the patient.
Treatments often fall into one of the below groupings: [Sourced from (British Heart Foundation, 2025) (NHS, 2025) (Mayo Clinic, 2025)]
On the one hand, false positives may cause undue stress, when someone is misleadingly worried about heart disease that they may not have.
If prescribed treatment is benign, such as in the case of lifestyle changes, it will make sense to be less cautious about coming to a positive diagnosis because taking on this prescription will at best come at large personal benefits and at worst incur marginal costs relative to those benefits. Eating less sugar and saturated fat leads to a slight quality of life reduction in the short term, but a healthier lifestyle has substantial benefits and often improves quality of life in the long term and that derived from profound and not superficial pleasures, even if a patient does not yet have nor is at risk of heart disease.
This applies also to more mild medicine prescriptions, with side-effects extending from negligible symptoms to some only as severe as dizziness, headaches and fatigue (ibid).
However, to the extent that treatments are more severe in their nature, such as stronger medicines or surgical procedures, we may wish to use a higher threshold to avoid the consequences of prescribing treatment to someone who was not at risk. Moving forward, due to the fact that variables in our data has not been previously used to prescribe surgical treatment, and the binary nature of our outcome variable has no indication of the severity of heart disease, which would be essential in this application, we will only consider the applications for diagnosis that results in lifestyle change and medicine prescriptions.
On the other hand, it is likely worth trading off more false negatives in exchange for more true positives so that a greater amount of corrective treatment can be prescribed, and so that we can prevent the conditions of more patients getting worse (even if they have yet to truly have coronary heart disease).
Likewise, any non-professional (perhaps a patient themselves, or a trainee) could also use such a flow chart structure to identify the chance of a positive diagnosis/ what factors tend to be most associated with high risk of heart disease, making it more practical to diagnose on a larger scale.
This first tree flow chart can be used either for informative purposes as to what may tentatively be leading factors behind heart disease, or diagnosis situations in which all the measures in the training data are accessible to the professional giving the diagnosis.
In this case, we use the rpart engine as opposed to the
tree engine.
[please delete unused lines from code, especially if it’s all commented out]
We begin by setting up a tuning grid and then viewing its accuracy and plot of the grid.
tuning_grid_tree <- expand.grid(
cp = c(0.001),
#seq(0.001, 0.0001, length.out = 10), # Cost complexity pruning (alpha parameter that imposes penalty on tree depth)
maxdepth = c(1,3,4,5,6,8,10,15,20), # Limit tree depth
min_n = c(5), # Number of obs per split, so if higher, reduces splits
thresh = c(0.3,0.4, 0.5),
formula = as.character("y ~.")
)
tuning_grid_tree$formula <- as.character(tuning_grid_tree$formula)
tree_output <- grid_validate_tidy(Heart_train, Heart_valid, tuning_grid_tree, "tree", "rpart")
# View the accuracy results
tree_output$accuracy_df
plot_grid(tree_output$accuracy_df, val_measure = "misclass_rate", tp1 = "maxdepth", tp2 = "thresh")
It is observable that once we reach depth 4, the validation set accuracy is relatively high, with only an approximate and acceptable 15% misclassification rate. Nonetheless, the short depth retains interpretability.
We then move on to analyze our results by extracting and observing the tuning parameters of the chosen specification. After we look at the confusion matrix of this chosen specification and then its ROC curve to gauge its cut-off trade-off.
# Check the predictions of the model with the best trade-off:
best_spec_idx <- 3 # since this is still relatively interpretable.
tree_fit_predictions <- tree_output$preds %>%
filter(spec_no == best_spec_idx)
tree_fit <- tree_output$model_fits[[best_spec_idx]]
#tree_output$accuracy_df[ best_spec_idx, ]
best_params <- tree_output$accuracy_df[best_spec_idx,]
best_params %>% glimpse()
## Rows: 1
## Columns: 6
## $ misclass_rate <dbl> 0
## $ cp <dbl> 0.001
## $ maxdepth <dbl> 4
## $ min_n <dbl> 5
## $ thresh <dbl> 0.3
## $ formula <chr> "y ~."
graph_preds(tree_fit_predictions$preds, tree_fit_predictions$y, cm=T, scatter=F)
## [1] "misclass_rate : 0"
tree_fit_predictions %>%
roc_curve(truth = y, prob_preds, event_level = "second") %>%
autoplot()
Note that the classification tree only requires that our false positive rate exceeds aprrox. 15% if we should wish to have higher than 75% classification accuracy, lower than the logit. This means that the predictive accuracy (at least within distribution) of a tree model is such that we face a more favourable tradeoff with diagnosing patients.
We now use the best tree model to learn about our data by using the
rpart engine.
# Define the best model specification
best_model_spec <- decision_tree(
mode = "classification",
cost_complexity = best_params$cp, # Cost-complexity pruning
tree_depth = best_params$maxdepth, # Max depth of tree
min_n = best_params$min_n # Minimum number of observations per node
) %>%
set_engine("rpart")
# Create workflow
best_workflow <- workflow() %>%
add_model(best_model_spec) %>%
add_formula(as.formula(best_params$formula))
# Fit the model to full training data
best_tree_fit <- fit(best_workflow, data = Heart_train)
# Extract the fitted model
best_rpart_model <- extract_fit_engine(best_tree_fit) # Gets the rpart model
# Plot the tree
rpart.plot(best_rpart_model, roundint = FALSE)
# the roundint = false is used to avoid a warning that rpart Cannot retrieve the original training data used to build the model (so cannot determine roundint and is.binary for the variables)
By asking How would [our tree] change with access to limited predictors? we filter our input space for the most measurable medical characteristics to fit on later.
# We now only select a SUBSET of Heart_train which includes only the most measurable data
Heart_subset <- Heart_train %>% select(y, sex, age, chol, trestbps)
We now look for a depth level that has high levels of prediction accuracy yet not [make sure to finish this sentence]
tuning_grid_tree <- expand.grid(
cp = c(0.001),
#seq(0.001, 0.0001, length.out = 10), # Cost complexity pruning (alpha parameter that imposes penalty on tree depth)
maxdepth = c(1,3,4,5,6,8,10,15,20), # Limit tree depth
min_n = c(5), # Number of obs per split, so if higher, reduces splits
thresh = c(0.3,0.4, 0.5),
formula = as.character("y ~.")
)
tuning_grid_tree$formula <- as.character(tuning_grid_tree$formula)
We do a similar analysis as above by viewing the grid and graphing it.
tree_output <- grid_validate_tidy(Heart_subset, Heart_valid, tuning_grid_tree, "tree", "rpart")
# View the accuracy results
tree_output$accuracy_df
plot_grid(tree_output$accuracy_df, val_measure = "misclass_rate", tp1 = "maxdepth", tp2 = "thresh")
It is observable that now the validation set accuracy is only lower at higher depths. A depth of 3 or to 6 makes little additional difference, but this accuracy is too low.
Instead, a depth 8 tree works quite well. Although this is an 8-point checklist, it is still not that many points and requires fewer readings to be taken [fewer than what?]. We can show this by comparing a depth 4 tree to a depth 8 tree.
We run the above procedure on a depth 4 tree.
# Check the predictions of the model with the best trade-off:
best_spec_idx <- 21 # since this is still relatively interpretable.
tree_fit_predictions <- tree_output$preds %>%
filter(spec_no == best_spec_idx)
tree_fit <- tree_output$model_fits[[best_spec_idx]]
#tree_output$accuracy_df[ best_spec_idx, ]
best_params <- tree_output$accuracy_df[best_spec_idx,]
best_params %>% glimpse()
## Rows: 1
## Columns: 6
## $ misclass_rate <dbl> 0.3311688
## $ cp <dbl> 0.001
## $ maxdepth <dbl> 4
## $ min_n <dbl> 5
## $ thresh <dbl> 0.5
## $ formula <chr> "y ~."
graph_preds(tree_fit_predictions$preds, tree_fit_predictions$y, cm=T, scatter=F)
## [1] "misclass_rate : 0.331168831168831"
tree_fit_predictions %>%
roc_curve(truth = y, prob_preds, event_level = "second") %>%
autoplot()
Because we have access only to limited measurements, we would need to trade-off more than 25% in false postive diagnoses to achieve higher than 75% accuracy, which is worse even than the logit model.
However, this is if we are limiting ourselves to a depth 4 (or 4-step) diagnosis process. If we increased this to 8, which is still retalively managable, we find the following.
# Check the predictions of the model with the best trade-off:
best_spec_idx <- 6 # since this is still relatively interpretable.
tree_fit_predictions <- tree_output$preds %>%
filter(spec_no == best_spec_idx)
tree_fit <- tree_output$model_fits[[best_spec_idx]]
#tree_output$accuracy_df[ best_spec_idx, ]
best_params <- tree_output$accuracy_df[best_spec_idx,]
best_params %>% glimpse()
## Rows: 1
## Columns: 6
## $ misclass_rate <dbl> 0.237013
## $ cp <dbl> 0.001
## $ maxdepth <dbl> 8
## $ min_n <dbl> 5
## $ thresh <dbl> 0.3
## $ formula <chr> "y ~."
graph_preds(tree_fit_predictions$preds, tree_fit_predictions$y, cm=T, scatter=F)
## [1] "misclass_rate : 0.237012987012987"
tree_fit_predictions %>%
roc_curve(truth = y, prob_preds, event_level = "second") %>%
autoplot()
Here, we can see that if, in the first stage of the checklist, we are willing to sacrifice a little extra time for medical professionals to go through a slightly lengthier process (8 steps to diagnosis and not 4 steps to diagnosis), then, in the second stage [the judgement/diagnosis], we do not need to sacrifice as many false positive diagnoses to achieve higher accuracies. Specifically, we only need to falsely diagnose approximately 6% of patients if we wish to diagnose 75% or more patients accurately.
(Insert everything related to the baseline model here, removing any redundancies shown above.)
Earlier, we evaluated how a single tree model could result in a more-desirable trade-off of more false positive diagnoses in return for more true positive diagnoses compared to a logit, but only if either we had access to measurements that are less practical to gather from a patient, or we were willing to extend the depth and complexity of the tree, and hence, the time it would take for a medical professional to carry out a step-based “checklist” diagnostic process.
However, in many environments today (including even in emerging economies or in the field), medical professionals have access to computer hardware and software that may be sufficient to run predictive models on new patient data to predict conditional probabilities for patients. This eliminates the need for a step-based “checklist” process, and allows diagnoses to be made systematically fora large number of patients in one go.
Of course, medical professionals still have to gather the data from patients, which means that we should still give priority to models that support the practicality of this part of the process. In other words, we will place priority on model specifications which use:
It therefore makes sense to consider models that may be more complex and less interpretable, but offer benefits to predictive accuracy (with a desireable trade-off) and practicality that make them more applicable for diagnosis (although perhaps not as interpretable for prescribing treatment).
For this, we choose to use a random forest model, which has a few predictive advantages:
Low bias, because it uses the predictions of numerous trees that are grown deep and so can fit the complex patterns in the data more closely.
Low variance, because it averages out the predictions of a large number of trees, each fitted on a bootstrapped (i.e., sampled with replacement) subsample of the training dataset.
Low variance, because (unlike bagging, a special case of random forests) it allows splits only on a certain subset of predictors, meaning that each tree is less correlated with the next, and so is less likely to amplify overfitting to our data.
These combine to create a model that has high predictive accuracy on new data, and, ideally, not just on new data from our our same wider dataset (i.e., can be used to generalise in the distribution), but also from new datasets (which may be out of distribution, to the extent that the underlying relationships have changed).
In other words, such a fitting procedure should capture as accurately the fundamental or structural relationships between the variables that do not change throughout time, as opposed to the transitory or reduced form relationships.This means that the low bias feature of our flexible model fitting procedure is particularly useful for this out of distribution generalisation.
Using a validation set approach, we can only guarantee the former, that we can generalise well within distribution. We aim to avoid too much “optimism” (encouraging our empirical risk to consistently underestimate the true risk), by avoiding the use of excessive degrees of freedom, yet also using sufficient predictors to reduce the irreducible error (we want it to be the case that a true function of these predictors is as good at predicting heart disease as possible).
We do this by choosing the tuning parameters and model specification to maximise validation (and not training) set accuracy. In other words, we attempt to ensure we capture the underlying relationships in our wider data as much as possible without just falsely misinterpreting the random noise or intricacies of the training subset of our data as signals of the underlying relationship.
As for the latter, we argue that the fact that the human body works much in the same way today than it did many decades ago (NHS, 2025) (National Institutes of Health, 2025) means that much of the relationships we capture hold true throughout time. Although the distribution and data generating process today has some differences to the time when the data was gathered, it is still similar to the distribution of our dataset, given that the structural biological relationships still hold.
However, the distribution and data generating process for a population in a different geographical area with therefore a different corresponding hereditary background is more likely to be fundamentally different to the narrow population in our dataset at any given point in time, because hereditary (gene-related) characteristics will are more likely to affect the relationships and more significantly and systematically so (British Heart Foundation, 2025). To the extent that these differences have a minor aggregate effect, the distributions of our data and this outside sample are still similar.
In particular, we argue that between our Cleveland sample and the wider US, we may introduce more variation in hereditary factors that slightly alter the idiosyncratic relationships, and thus increase irreducible error (our model does not account for these factors, so will simply mis-estimate a little more). However, these will not be systematically different to those in our Cleveland sample, so will mean that our estimate of the data generating process is not systematically wrong or biased.
We first create a tuning grid and for the random forest and then view the grid as well as its plot before analyzing the results.
tuning_grid_rf <- expand.grid(
mtry = 1:6,
trees = c(500),
thresh = c(0.4, 0.5),
formula = as.character(formulas[3])
)
tuning_grid_rf$formula = as.character(tuning_grid_rf$formula)
tuning_grid_rf
# Run the grid validation function
rf_output <- grid_validate_tidy(Heart_train, Heart_valid, tuning_grid_rf, "random_forest", "randomForest")
rf_output$accuracy_df
plot_grid(rf_output$accuracy_df, val_measure = "misclass_rate", tp1 = "mtry", tp2 = "thresh")
# Check the predictions of the model with the lowest misclass rate
best_spec_idx <-which.min(rf_output$accuracy_df$misclass_rate)
rf_fit_predictions <- rf_output$preds %>%
filter(spec_no == best_spec_idx)
#check_preds
We similarly look at the confusion matrix and the ROC curve to gauge cut-off trade-off.
graph_preds(rf_fit_predictions$preds, rf_fit_predictions$y, cm=T, scatter=F)
## [1] "misclass_rate : 0.0194805194805195"
rf_fit_predictions %>%
roc_curve(truth = y, prob_preds, event_level = "second") %>%
autoplot()
Because predicted classes can only be 100% accurate of 0% accurate for a given observation (the outcome can take a value of 0 or 1), then even if the continuous predicted conditional probabilities may be slightly different to their true values, it is entirely possible that we classify every observation in a validation set to its true class, especially if we don’t have too many observations, and if the data generating process makes patients in the positive class quite distinguishable from those in the negative class.
Thus, the fact that there is no-need to sacrifice false positives for diagnostic accuracy means that this model class provides a highly desirable trade-off that is genuine.
We now use the best model to learn about our data. We first identify the “best row” based on model accuracy. Then we find variable importance.
best_spec_row <- rf_output$accuracy_df[best_spec_idx,]
best_spec_row
rf_fit <- randomForest(y ~ ., data = Heart_train, ntree = best_spec_row$trees, mtry = best_spec_row$mtry, importance = TRUE)
varImpPlot(rf_fit)
Although there are no splits or coefficients to use to interpret the predictive contribution of each variable, we can still use less interpretable, more abstract measures of variable importance.
The above charts show that the “most important” variables in terms of
their contribution towards marginally increasing the strength of fit for
the random forest model include oldpeak (short term
depression induced by exercise relative to rest), thalach
(maximum heart rate achived), age, chol (blood
cholesterol levels), and trestbps (resting bloob
pressure).
This makes intuitive sense, since being unaccustomed to exercise, having high cholesterol and blood pressure, and simply being older are commonly understood to cause heart disease, so measures of these should be good predictors. These plots do not validate the causal effect, but serve to inform about prediction/diagnosis.
The last 3 are practical measures that are easy to collect in a relatively short space of time, motivating the benefits of this model.
We now guage the functional form of the conditional probability using Partial Dependence Plots (PDPs).
pred_rf <- Predictor$new(rf_fit)
pdp_rf <- FeatureEffects$new(pred_rf, features = c("oldpeak","thalach", "age", "chol"), method = "pdp+ice")
plot(pdp_rf)
Unfortunately, each individual predictor only appears to have a small marginal/partial effect on the probability of having heart disease. Instead, its conditional effects may be more important.
To find the best \(n\)-variable random forest, we follow a similar procedure but with using mixed subset selection and then examining the plots.
tuning_grid_rf_step <- expand.grid(
mtry = 1:6,
trees = c(500),
thresh = c(0.5),
formula = as.character("")
)
tuning_grid_rf_step$formula = as.character(tuning_grid_rf_step$formula)
tuning_grid_rf_step
stepwise_selection_rf_output <- stepwise_selection(Heart_train, Heart_valid, tuning_grid_rf_step[1,], "random_forest", "randomForest", predictors_lim = 4)
stepwise_rf_df <- stepwise_selection_rf_output$accuracy_df # Table with steps
stepwise_rf_df
plot_grid(stepwise_rf_df, val_measure = "misclass_rate", tp1 = "total_steps", tp2="formula", tp3="n")
We find that the second 2 variable random forest model (3 forward
steps and 2 backward steps and then another forward step) works quite
well with chol and thalach being the most
accurate. Having selected the best model, we now extract this
specification, examine the tuning parameters, look at the confusion
matrix, and graph the ROC curve like before.
# Check the predictions of the model with the best trade-off:
best_spec_idx <- 6 # since this is still relatively interpretable.
stepwise_rf_fit_predictions <- stepwise_selection_rf_output$preds %>% filter(total_steps == best_spec_idx, best==1)
stepwise_rf_fit <- stepwise_selection_rf_output$model_fits[[best_spec_idx]]
#tree_output$accuracy_df[ best_spec_idx, ]
best_params <- stepwise_rf_df[best_spec_idx,]
best_params %>% glimpse()
## Rows: 1
## Columns: 9
## $ misclass_rate <dbl> 0
## $ formula <chr> "y ~ target + age"
## $ total_steps <dbl> 6
## $ total_fwd <dbl> 4
## $ total_bkwd <dbl> 2
## $ n <dbl> 2
## $ mtry <int> 1
## $ trees <dbl> 500
## $ thresh <dbl> 0.5
graph_preds(stepwise_rf_fit_predictions$preds, stepwise_rf_fit_predictions$y, cm=T, scatter=F)
## [1] "misclass_rate : 0"
stepwise_rf_fit_predictions %>%
roc_curve(truth = y, prob_preds, event_level = "second") %>%
autoplot()
Unlike with the logit, we don’t face nearly as steep a trade-off with false positives. Achieving over 75% accuracy requires only an approximately 3% false positive rate.
Moreover, this model can be used to conduct highly practical diagnoses, since it only requires taking two pieces of data from the patient that can each be accurately collected in a short space of time.
Did we achieve our objective? Why/why not?
Bibliography British Heart Foundation. (2025, March). Family History of Heart and Circulatory Diseases (BHF). Retrieved from British Heart Foundation: https://www.google.com/search?q=have+the+same+factors+always+caused+heart+disease&newwindow=1&sca_esv=0472b01422b69ec8&sxsrf=AHTn8zr-98IX2GHtRBseo7vSVFNLi-K4ig%3A1744452819588&ei=0zz6Z5vRI7qDhbIPyoWukAo&ved=0ahUKEwjbho6VodKMAxW6QUEAHcqCC6IQ4dUDCBE&uact=5& British Heart Foundation. (2025, April 12). Medicines for Heart Conditions - Treatments. Retrieved from British Heart Foundation: http://bhf.org.uk/informationsupport/treatments/medication#:~:text=Holidays%20and%20travel-,Types%20of%20medicine%20for%20heart%20conditions,SGLT2%20inhibitors. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2023). An Introduction to Statistical Learning with Applications in R. Mayo Clinic. (2025, March). Heart disease - Diagnosis and treatment - Mayo Clinic. Retrieved from Mayo Clinic: https://www.mayoclinic.org/diseases-conditions/heart-disease/diagnosis-treatment/drc-20353124 National Institutes of Health. (2025, March). Bioengineering and the Cardiovascular System - PMC. Retrieved from NIH.org: https://www.google.com/search?q=have+the+biological+fundamentals+of+the+human+body+with+respect+to+heart+disease+changed+in+the+last+50+years&newwindow=1&sca_esv=0472b01422b69ec8&sxsrf=AHTn8zpnxWNSxCZU6jBsjJANayje5PMEag%3A1744457102261&ei=jk36Z8baD62ahbIP NHS. (2025, March). Coronary Heart Disease - Causes. Retrieved from NHS: https://www.nhs.uk/conditions/coronary-heart-disease/causes/#:~:text=Coronary%20heart%20disease%20(CHD)%20is,have%20diabetes NHS. (2025, April). Coronary heart disease - Treatment. Retrieved from NHS: https://www.mayoclinic.org/diseases-conditions/heart-disease/diagnosis-treatment/drc-20353124
Only if you get disgusting enough to go really technical. OR, if we tried something first, and it didn’t wuite work, we can show that here.